Chapter 8 Chinese Text Processing

In this chapter, we will discuss one of the most important issues in Chinese language/text processing, i.e., word segmentation. When we discuss tokenization in Chapter ??, it is easy to do the word tokenization in English as the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.

This chapter is devoted to Chinese text processing. We will look at the issues of word tokenization and talk about the most-often used library, jiebaR, for Chinese word segmentation. Also, we will include several case studies on Chinese text processing.

8.1 Chinese Word Segmenter jiebaR

8.1.1 Start

First, if you haven’t installed the library jiebaR, you may need to install it manually:

This is the version used for this tutorial.

## [1] '0.11'

Now let us take a look at a quick example. Let us assume that in our corpus, we have collected only one text document, with only a short paragraph.

There are two important steps in Chinese word segmentation:

  • initilzie a word segmenter object using worker()
  • segment the texts using segment()
##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"

To segment the document, text, you first initialize a segmenter seg1 using worker() and feed this segmenter to segment(jiebar = seg1)and segment text into words.

8.1.2 Settings

There are many different parameters you can specify when you initialize the segmenter worker(). You may get more detail via the documentation ?worker. Some of the important arguments include:

  • user = ...: This argument is to specify the path to a user-defined dictionary
  • stop_word = ...: This argument is to specify the path to a stopword list
  • symbol = FALSE: Whether to return symbols (the default is FALSE)
  • bylines = FALSE: Whether to return a list or not

8.1.3 User-defined dictionary

From the above example, it is clear to see that some of the words are not correctly identified by the current segmenter: for example, 民眾黨, 不分區, 黃瀞瑩, 柯文哲. It is always recommended to include a user-defined dictionary when doing the word segmentation because different corpora may have their own unique vocabulary. This can be done when you initialize the segmenter using worker().

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指"      
##  [7] "民眾黨"   "不分區"   "被"       "提名"     "人"       "蔡壁如"  
## [13] "黃瀞瑩"   "在昨"     "6"        "日"       "才"       "請辭"    
## [19] "是"       "為領"     "年終獎金" "台灣"     "民眾黨"   "主席"    
## [25] "台北"     "市長"     "柯文哲"   "7"        "日"       "受訪"    
## [31] "時則"     "說"       "都"       "是"       "按"       "流程"    
## [37] "走"       "不要"     "把"       "人家"     "想得"     "這麼"    
## [43] "壞"

The format of the user-defined dictionary is a text file, with one word per line. Also, the default encoding of the dictionary is UTF-8. Please note that in Windows, the default encoding of a txt file created by Notepad may not be UTF-8.

Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others. However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR. To do the coversion automatically, please consult the library cidian. Also, you need to do the traditional-simplified Chinese conversion as well. For this, you may consult the library ropencc in R.

8.1.4 Stopwords

When you initialize the segmenter, you can also specify a stopword list, i.e., words you do not need to include in the later analyses. For example, in text mining, functional words are usually less informative.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "才"       "請辭"     "為領"     "年終獎金" "台灣民眾" "黨"      
## [25] "主席"     "台北"     "市長"     "柯文"     "哲"       "7"       
## [31] "受訪"     "時則"     "說"       "按"       "流程"     "走"      
## [37] "不要"     "把"       "人家"     "想得"     "這麼"     "壞"

8.1.5 POS Tagging

So far we haven’t seen the parts-of-speech tags provided by the word segmenter. If you need the POS tags of the words, you need to specify the argument type = "tag" when you initialize the worker().

##          n         ns          n          x          n          n          x 
##     "綠黨"   "桃園市"     "議員"   "王浩宇"     "爆料"       "指"   "民眾黨" 
##          x          p          v          n          x          x          x 
##   "不分區"       "被"     "提名"       "人"   "蔡壁如"   "黃瀞瑩"     "在昨" 
##          x          d          v          x          n          x          x 
##        "6"       "才"     "請辭"     "為領" "年終獎金"     "台灣"   "民眾黨" 
##          n         ns          n          x          x          v          x 
##     "主席"     "台北"     "市長"   "柯文哲"        "7"     "受訪"     "時則" 
##         zg          p          n          v         df          p          n 
##       "說"       "按"     "流程"       "走"     "不要"       "把"     "人家" 
##          x          r          a 
##     "想得"     "這麼"       "壞"

The following table lists the annotations of the POS tagsets used in jiebaR:

8.1.7 Reminder

When we use segment() as a tokenization method in the unnest_tokens(), it is very important to specify bylines = TRUE in worker(). This setting would make sure that segment() takes a text-based vector as input and return a list of word-based vectors of the same length as output.

NB: When bylines = FALSE, segment() returns a vector.

## [[1]]
##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"
##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"
## [1] "list"
## [1] "character"

8.2 Chinese Text Analytics Pipeline

In Chapter ??, we have talked about the work pipeline for normal English texts processing, as shown below:

For Chinese texts, the work flow is pretty much the same. The most important trick is in the step of tokenization, i.e., unnest_tokens(): we need to specify our own tokenzier for the argument token = ... in the unnest_tokens().

It is important to note that when we specify a self-defined token function, this function should take a character vector (i.e., a text-based vector) and return a list of character vectors (i.e., word-based vectors) of the same length.

In other words, when initializing the Chinese word segmenter, we need to specify the argument byline = TRUE for worker(byline = TRUE).

So based on our simple-corpus example above, we can create

In the following sections, we look at a few more case studies of Chinese text processing using the news articles collected from Apple News as our example corpus. The dataset is available in our course dropbox drive: demo_data/applenews10000.tar.gz.

8.3 Case Study 1: Word Frequency and Wordcloud

We follow the same steps as illstrated in the above flowchart ??:

  • create a text-based tidy corpus object apple_df (i.e., a tibble)
  • intialize a word segmenter using worker()
  • tokenize the corpus into a word-based tidy corpus object using unnest_tokens()

With a word-based corpus object, we can easily generate a word frequency list as well as a wordcloud to have a quick view of the word distribution in the corpus.

8.4 Case Study 2: Patterns

In this case study, we are looking at a more complex example. In linguistics analysis, we often need to extract a particular pattern from the texts. In order to retrieve the target patterns at a high accuracy rate, we often make use of the additional annotations provided by the corpus. The most often-used information is the parts-of-speech tags of words. So here we demonstrate how to add POS tags information to our current tidy corpus design.

Important steps are as follows:

  • Create a self-defined tokenization function, which takes a text and returns the same text but with the POS-tags of all the words included.

We defined the function chi_pos_tagger(), which takes a text in and returns the text out with the POS tags of the words appended at the end of each word.

## [1] "綠黨_n 桃園市_x 議員_n 王浩宇_x 爆料_n ,_x 指民眾_x 黨_n 不_d 分區_n 被_p 提名_v 人_n 蔡壁如_x 、_x 黃_zg 瀞_x 瑩_zg ,_x 在昨_x (_x 6_x )_x 日_m 才_d 請辭_v 是_v 為領_x 年終獎金_n 。_x 台灣_x 民眾_x 黨_n 主席_n 、_x 台北_x 市長_x 柯文_nz 哲_n 7_x 日_m 受訪_v 時則_x 說_zg ,_x 都_d 是_v 按_p 流程_n 走_v ,_x 不要_df 把_p 人家_n 想得_x 這麼_x 壞_a 。_x"
## [1] "我_r 是_v 在_p 測試_vn 一個_x 句子_n"
  • Tokenize the text-based tidy corpus into a sentence-based one, and POS-tag each sentence and put the tagged version of the sentences in a new column

In the above example, we adopt a very naive approach by treating any linguistic unit in-between the punctuation marks as a possible sentence-like unit. This can be controversial to many grammarians and syntaticians. However, in practice, this may not be a bad choice as it will become obvious when we extract patterns.

For more information related to the unicode ranage for the punctuations in CJK languages, please see this SO discussion thread.

After we tokenize our text-based tidy corpus into a inter-punctuation-unit-based (IPU) tidy corpus, we can make use of the words as well as their parts-of-speech tags to extract the target pattern we are interested: 被 + ... constructions. The data retrieval process is now very straighforward: we only need to go through all the IPU units in the corpus object and see if our target pattern matches any of these IPU units.

In the following example, we:

  • subset IPUs with the target pattern \\b被_p\\b using str_detect()
  • extract the strings that match the target pattern \\b被_p\\s([^_]+_[^\\s]+\\s)*?[^_]+_v using str_extract() and add these strings to a new column using mutate()
  • extract the verb in the BEI construction using str_extract() and add these verbs to a new column using mutate

Exercise 8.1 When you take a closer look at the resulting word cloud above, you would see the copular verb 是 showing up in the graph, which is counter to our native speaker intuition. How do you check the instances of these 是 tokens? After you examine these cases, what do you think may be the source of the problem?

Exercise 8.2 Please use the apple_ipu as your tidy corpus and extract Chinese particle constructions of ... 外/內/中. Usually a space particle construction like these consists of a landmark NP (LM) and the space particle (SP). For example, in 任期內, 任期 is the landmark NP and is the space particle. In this exercise, we will naively assume that the word directly preceding the space particle is our landmark NP head noun. So please (a) extract all concordance lines with these space particles and (b) at the same time identify their respective SP and LM, as shown below.

Exercise 8.3 Following Exercise 8.2, please generate a frequency list of the LMs for each spac particle. Show us the top 10 LMs of each space particle and arrange the frequencies of the LMs in a descending order, as shown below.

Exercise 8.4 Following Exercise 8.3, for each space particle, please create a word cloud of its co-occuring LMs based on the top 200 LMs of each particle.

PS: The word frequencies in the word clouds shown below are on a log scale.


Exercise 8.5 From the above provided in Exercise 8.3, the graph of 內 shows a few LMs that are counter intuitive to our native knowledge: for example, 出車, 的, 做, 到. Can you tell us why? What would be problems? What did we do wrong in the previous processing?

8.5 Case Study 3: Lexical Bundles

With word boundaries, we can also analyze the recurrent multiword units in Chinese news. Here let’s take a look at recurrent four-grams. As we discussed in Chapter ??, a multiword unit can be defined based on at least two metrics:

  • the frequency of the whole multiword unit (i.e., frequency)
  • the number of texts where the multiword unit is observed (i.e., dispersion)

As the default tokenization in unnest_tokens() only works with the English data, we start this task by defining our own token function ngram_chi() to extract Chinese n-grams.

This ngram_chi() takes ONE text (scalar) as an input, and returns a vector of n-grams. Most importantly, this function assumes that in the text string, each word token is delimited by a whitespace.

## [1] "這_是"     "是_一個"   "一個_測試" "測試_的"   "的_句子"   "句子_。"
## [1] "這_是_一個_測試"   "是_一個_測試_的"   "一個_測試_的_句子"
## [4] "測試_的_句子_。"
## [1] "這 是 一個 測試 的"   "是 一個 測試 的 句子" "一個 測試 的 句子 。"

We vectorize the function ngram_chi(). This step is important because in unnest_tokens() the self-defined token function should take a text-based vector as input and return a list of token-based vectors of the same length as the output (cf. Section 8.2).


Vectorized functions are a very useful feature of R, but programmers who are used to other languages often have trouble with this concept at first. A vectorized function works not just on a single value, but on a whole vector of values at the same time.

In our first defined ngram_chi function, it takes one text vector as an input and process it one at a time. However, we would expect ngram_chi to process a vector of texts (i.e., multiple texts) at the same time and return a list of resulting ngrams vectors at the same time. Therefore, we use Vectorize() as a wrapper to vectorize our function and specifically tell R that the argument text is vectorized, i.e., process each value in the text vector in the same way.


Now we can tokenize our corpus into n-grams using our own token function vngram_chi() and the unnest_tokens(). In this case study, we demonstrate the analysis of four-grams in our Apple News corpus.

  • Because we need calculate not only the frequencies of the n-grams but also their dispersions, we begin by first creating a sentence ID for the IPUs of each article.
  • Then we remove all the POS tags because n-grams extraction do not need the POS tag information.
  • Finally, these cleaned versions of IPU, stored in the new column IPU_word, are used as input for unnest_tokens() and we specify our own token function vngram_chi() to tokenize IPU_word into four-grams.

Now that we have the four-grams-based tidy corpus object, we can compute their token frequencies and document frequencies in the corpus using the normal data manipulation tricks.

Please take a look at the four-grams, both arranged by frequency and dispersion: